From genes • protein structui# and function: 
novel applications of computational 
approaches in the genomic era 

Jeffrey Skolnick and Jacquelyn S. Fetrow 

The genome-sequencing projects are providing a detailed 'parts list of life. A key to comprehending this list is understanding 
the function of each gene and each protein at various levels. Sequence-based methods for function prediction are inadequate 
because of the multifunctional nature of proteins. However, just knowing the structure of the protein is also insufficient for 
prediction of multiple functional sites. Structural descriptors for protein functional sites are crucial for unlocking the secrets 
in both the sequence and structural-genomics projects. 



Genome-sequencing projects are providing a 
detailed Mparts list' for life. Unfortunately, this list, 
a portion of which represents the amino acid 
sequence of all the proteins in a given genome, does 
not come with an instruction manual. That is, given 
the genome s sequences, one does not necessarily know 
straight away which regions encode proteins, which 
serve a regulatory role and which are responsible for 
the structure and replication of the DNA itself 

This is not unlike giving a child a list of parts nec- 
essary to create a working automobile. Without the 
necessary expertise, creating the final, working car firom 
just the initial parts list is a nearly impossible task. Simi- 
larly, understanding how to create a complete, func- 
tioning cell given just the sequence of nucleotides 
found in an organism s genome is a complex problem. 

What is a protein function? 

After a genome is sequenced and its complete parts 
list determined, the next goal is to understand the fiinc- 
tion(s) of each part, including that of the proteins. What 
do we mean by protein fiinction, the focus of this article? 

Function has many meanings. At one level, the pro- 
tein could be a globular protein, such as an enzyme, 
hormone or antibody or it could be a structural or 
membrane-bound protein. Another level is its bio- 
chemical function, such as the chemical reaction and 
the substrate specificit\' of an enzyme. The regulatory 
molecules or cofactors that bind to a protein are also 
levels of biochemical function. 

At the cellular level, the protein's function would 
involve its interaction with other macromolecules and 
the function and cellular location of such complexes. 
There is also the protein s physiological function; that 
is, in which metabolic pathway the protein is involved 
or what physiological role it performs in the organism. 
Finally, the phenotypic function is the role played by 
the protein in the total organism, which is observed by 
deleting or mutating the gene encoding the protein. 
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Obviously, the complete characterization of protein 
fiinction is difficult but efforts are under way at all levels*^, 
including cellular function^^^. In this article, however, 
we focus on identifying the biochemical function of a 
protein given its sequence, a problem that is amenable to 
molecular approaches. 

Sequence-based approaches to function 
prediction 

The sequence-to-function approach is the most com- 
monly used function-prediction method. This robust 
field is well developed and, in the interest of space 
limitations, we will merely present a brief overview. 

There are two main flavors of this approach: sequence 
aUgnment^"^; and sequence-motif methods such as 
Prosite^o^ Blocksii, Prints^^.n ^^d Emotif Both the 
alignment and the motif methods are powerful but a 
recent analysis has demonstrated their significant Hmi- 
tations'-'*, suggesting that these methods will increasingly 
fail as the protein-sequence databases become more 
diverse. 

An extension of these approaches that combines 
protein-sequence with structural information has been 
developed and some successes have been reported 
However, this method still applies the structural infor- 
mation in a one-dimensional, 'sequence-like' fashion 
and fails to take into account the powerful three- 
dimensional information displayed by protein structures. 

In addition, proteins can gain and lose function dur- 
ing evolution and may, indeed, have multiple functions 
in the cell (Box 1). Sequence-to-function methods 
cannot specifically identify these complexities. Inaccu- 
rate use of sequence-to-function methods has led to 
significant function-annotation errors in the sequence 
databases'"^. 

An alternative approach 

An alternative, complementary approach to protein- 
function prediction uses the sequence-to-structure-to- 
fimction paradigm. Here, the goal is to determine the 
structure of the protein of interest and then to identify 
the functionally important residues in that structure. 
Using the chemical structure itself to identity functional 
sites is more in line with how the protein actually works. 
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Iti a sense, this is one long-term goal of 'structural 
genomics' projects which are designed to deter- 
mine all possible protein folds experimentally, just 
as genome-sequencing projects are determining all 
protein sequences-^. This is in contrast to traditional 
structural-biology approaches, in which one knows the 
protein's function first and only then, if the function is 
sufficiently important, determines its structure. 

It is implicitly assumed that having the protein's struc- 
ture will provide insights into its function, thereby fur- 
thering the goals of the human-genome-sequencing 
project. However, knowing a protein s three-dimensional 
structure is insufficient to determine its function 
(Box 2) . What we really need to analyse and predict the 
multifunctional aspects of proteins is a method spe- 
cifically to recognize active sites and binding regions in 
these protein structures. 

Active-site identification 
\ In order to use a structure-based approach to function 

1 prediction, one must identify the key residues respon- 

sible for a given biochemical activity. For many years, 
it has been suggested that the active sites in proteins are 
' better conserved than the overall fold. Taken to the 

limit, this suggests that one could not only identify dis- 
tant anceston with the same global fold and the same 
j activity but also proteins with similar functions but 

distandy related, or possibly unrelated, global folds. 
; The validity of this suggestion was demonstrated 

i empirically by Nussinov and co-worken, who showed 

1 that the active sites of eukaryotic serine proteases, sub- 

tilisins and sulfhydryl proteases exhibit similar structural 
motifs^^ Furthermore, in a recent modeling study of 
Saccharomyces cerevisiae proteins, protein functional sites 
were found to be more conserved than other parts of 
the protein modek^^. Similarly, it has been demon- 
strated that the catalytic triad of the a/ (3 hydrolases 
is structurally better conserved than other histidine- 
I containing triads^^. A comparison of the structure of the 

i hydrolase catalytic triad to other histidine-containing 

j triads shows a distinct bimodal distribution, while a 

similar analysis done with a randomly selected triad shows 
> a unimodal distribution (Fig. 1). 

Kasuya and Thornton^"* generalized this example by 
creating structural analogs of a few Prosite sequence 
motife^^. For the 20 most-firequently occurring Prosite 
patterns, the associated local structure is quite distinct. 
These results provide clear evidence that enzyme active 
sites are indeed more highly conserved than other parts 
j of the protein. 

Identifying active sites in experimental structures 

Historically, several groups have attempted to iden- 
' tify functional sites in proteins; these eflforts were 

directed at protein engineering or building fiinctional 
sites in places where they did not previously exist. This 
has been successfully accomplished for several metal- 
binding sites^^^^. However, highly accurate functional- 
site descripton of the backbone and side-chain atoms were 
required, fueling the belief that significant atomic detail 
is required in site descriptors for function identification. 

Highly detailed residue side-chain descriptors of the 
active sites of serine proteases and related proteins have 
been used to identify functional sites^. The use of these 
highly detailed motifs has led to the identification of 




Box 1. Proteins are multifunctional 



A common protein characteristic that makes functional analysis based 
only on homology especially difficult is the tendency of proteins to be 
multifunctional. For instance, lactate dehydrogenase binds NAD, sub- 
strate and zinc, and performs a redox reaction. Each of these occurs 
at different functional sites that are in close proximity and the combi- 
nation of ail four sites creates the fully functional protein. 

Other examples of multifunctional proteins are the nucleic-acid-binding 
proteins. For instance, DMA regulatory proteins often contain a DNA- 
binding domain, a multimerization domain and additional sites that bind 
regulatory proteins; a classic example is RecA^^. The 3C rhinovirus 
protease exhibits a proteolytic function as well as an RNA-binding 
functionso.si. Transcription factors are also complex, multifunctional 
proteins62, it is becoming increasingly important to recognize each of 
these different functions of gene products of a newly sequenced gene. 

The serine-threonine-phosphatase superfamily is a prime example of 
the difficulties of using standard sequence analysis to recognize the 
multiple functions found in single proteins. This large protein family is 
divided into a number of subfamilies, all of which contain an essential 
phosphatase active site. Subfamilies 1 , 2A and 2B exhibit 40% or more 
sequence identity between them^^. However, each of these subfamilies 
is apparently regulated differently in the cell^^^ and observation sug- 
gests that there are different functional sites at which regulation can 
occur. Because the sequence identity between subfamilies is so high, 
standard sequence-similarity methods could easily misclassify new 
sequences as members of the wrong subfamily if the functional sites 
are not carefully considered, aS was recently demonstrated^^. 

These are but a few examples of the multifunctionality of proteins. 
The recognition of this multifunctional nature is of critical importance 
to the genomics field. Useful functional-annotation methods must con- 
sider all of the specific functions in a given protein and will not just 
provide a general classification of function. 



several novel functional sites in known, high-quality 
protein structures^-*'*. More automated methods for 
finding spatial motifs in protein structures have also 
been described2i'^+-^. 

Unfortunately, most of these methods require the 
exact placement of atoms within protein backbones and 
side chains, and so have not been shown to be relevant 
to inexact predicted structures. Recendy, however, we 
described the production of fuzzy, inexact descriptors 
of protein functional sites^^. As we wish to apply the 
descriptors to experimental structures as well as to pre- 
dicted protein models, we used only carbon atoms and 
side-chain centers-of-mass positions. We call these 
descriptors *fuzzy functional forms' (FFFs) and have 
created them for both the disulfide-oxidoreductase^^.-^i 
and a/p-hydrolase catalytic active sites^^. 

The disulfide-oxidoreductase EPF was applied to 
screen high-resolution structures fix>m the Brookhaven 
protein database**-. In a dataset of 364 protein structures, 
the FFF accurately identified all proteins known to 
exhibit the disulfide-oxidoreductase active site • 5. In a 
larger dataset of 1501 proteins, the FFF again accurately 
identified all proteins with the active site. In addition, 
it identified another protein, l§m, a serine-threonine 
phosphatase. This result was initially discouraging but 
subsequent sequence alignment and clustering analysis 
stron^y suggested that this putative site might indeed 
be a site of redox regulation in the serine-threonine 
phosphatase- 1 subfamily^^. If confirmed by experiment, 
this result will highlight the advantages of using struc- 
tural descriptors to analyse multiple functional sites in 
proteins. It will also highlight the fact that human 



Box 2. Knowing a protein's st^ure does not necessarily 
tell you its^mction 



Because proteins can have similar folds but different functions^-^^, 
determining the structure of a protein may or may not tell you some- 
thing about its function. The most well-studied example is the (o/plg 
barrel enzymes, of which triose^hosphate isomerase (TIM) is the arche- 
typal representative. Members of this family have similar overall struc- 
tures but different functions, including different active sites, substrate 
specificities and cof actor requirements^o-^i^ 

Is this example common? Our own analysis of the 1997 SCOP data- 
base's shows that the five largest fold families are the ferredoxin- 
like, the (a/p) barrels, the knottins, the immunoglobulin-like and the 
flavodoxin^ike fold families with 22, 18, 13, 9 and 9 subfamilies, respec- 
tively (Fig. i). In fact, 57 of the SCOP fold families consist of multiple 
superfamilies. These data only show the tip of the iceberg, because 
each superfamily is further composed of protein families and each indi- 
vidual family can have radically different functions. For example, the 
ferredoxinJike superfamily contains families identified as Fe-S fen^edoxins, 
ribosomal proteins, DNA-binding proteins and phosphatases, among 
others. 

After this article was submitted, a much-more^letailed analysis of the 
SCOP database was published72. This finds a broad function-structure 
correlation for some structural classes, but also finds a number of 
ubiquitous functions and structures that occur across a number of iapv 
ilies. The article provides a useful analysis of the confidence with which 
structure and function can be correlated72. Knowing the protein struc- 
ture by itself is insufficient to annotate a number of functional classes 
and is also insufficient for annotating the specific details of protein 
function. 
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Figure i 

Histogram of the numbers of superfamilies found in each SCOP fold family. 
These data clearly show that proteins with similar structures can have different 
functions and demonstrate the difficulty of assigning protein function based 
simply on the three<limensional structure. The data were taken from the 1997 
distribution of SCOP {http://scop.mrc-lmb.cam.ac.uK/scop). For a morfrdetailed 
analysis, see Ref. 72. 



observation alone is no longer adequate for identifying 
all functional sites in known protein structures. 

To date, the use of structure to identify fiinction has 
largely focused on high-resolution structures and highly 
detailed descriptors of protein functional sites. How- 
ever, the creation of inexact descriptors for functional 
sites opens the way to the application of these methods 
to inexact, predicted protein models. The question 
remains: how good does a model have to be in order 
to use FFFs to identify its active sites? 



The state o^ie art in structure-prediction 
methods 

For proteins whose sequence identity is above --30%, 
one can use homology modeling to build the struc- 
ture^. However, structure prediction is far more difficult 
for proteins that are not homologous to proteins with 
known structure. At present, there are two approaches for 
these sequences: ah initio folding^^** and threading^^53 

In ab initio folding, one starts from a random confor- 
mation and then attempts to assemble the native struc- 
ture. As this method does not rely on a library of 
pre-existing folds, it can be used to predict novel 
folds. The recent CASP3 protein-structure-prediction 
experiment (http://PredictionCenter.llnl.gov/CASP3) 
involved the blind prediction of the structure of pro- 
teins whose actual structure was about to be experi- 
mentally determined. These results indicate that con- 
siderable progress has been made'*^-^'*. For hehcal and 
a/(5 proteins with less than 110 residues, structures 
were often predicted whose backbone root-mean- 
square deviation (PJVISD) from native ranged from 
4-7 A. Progress is being made with the p proteins, too, 
although they remain problematic. Because ab initio 
methods can identify novel folds, these methods could 
be used to help to select sequences hkely to yield novel 
folds in experimental structural-genomics projects. 

Another approach to tertiary-structure prediction is 
threading. Here, for the sequence of interest, one 
attempts to find the closest matching structure in a 
library of known folds^--^^. Threading is applicable to 
proteins of up to 500 residues or so and is much faster 
than ab initio approaches. However, threading cannot 
be used to obtain novel folds. 

Ab initio predicted models can he used for automatic 
protein-function prediction 

The results of the recent CASP3 competition sug- 
gest that current modeling methods can often (but not 
always) create inexact protein models. Are these struc- 
tures usefial for identifying functional sites in proteins? 
Using the ab initio structure-prediction program 
MONSSTER, the tertiary structure of a glutaredoxin, 
lego, was predicted^^. For the lowest-energy model, 
the overaD backbone RMSD from the crystal structure 
was 5.7 A. 

To determine whether this inexact model could be 
used for function identification, the sets of correcdy 
and incorrecdy folded structures were screened with 
the FFF for disulfide-oxidoreductase activity*"*. The 
FFF uniquely identified the active site in the correcdy 
folded structure but not in the incorrecdy folded ones 
(Fig. 2). This is a proof-of-principle demonstration that 
inexact models produced by ah initio prediction of 
structure from sequence can be used for the subsequent 
prediction of biochemical function. Of course, improve- 
ments in the method have to be made before such 
predictions can be done on a routine basis. 

Use of predicted structures from threading in 
protein-function prediction 

At present, practical Hmitations preclude folding an 
entire genome of proteins using ab initio methods""^. 
Threading Ls more appropriate for achieving the requisite 
high- throughput structure prediction. Thus, a stand- 
ard tha'ading algorithm'**' has been used to screen all 
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proteins in nine i^cnonics for the d^B*^iL^-oxidorecluctase 

;tctive site tlescribecl above. 

First, sequences that aligned with the structures of 
known disulfide oxidoreductases were identified. Then, 
the structure was searched for matches to the active- 
site residues and geometry. For those sequences for 
which other homologs were available, a sequence- 
conservation profile was constructed--*. If the putative 
active-site residues were not conserved in the sequence 
subfamily to which the protein belongs, that sequence 
was eliminated. Otherwise, the sequence is predicted 
to have the function. 

Using this sequence-to-structure-to-Rinction method, 
99% of the proteins in the nine genomes that have 
known disulfide-oxidoreductase activity have been 
found. From 10% to 30% more functional predictions 
are made than by alternative sequence-based approaches; 
similar results are seen for the a/ (3 hydrolases-^. Sur- 
prisingly in spite of the fact that threading algorithms 
have problems generating good sequence-to-structure 
alignments, active sites are often accurately aligned, 
even for very distant matches. This observation would 
agree with the above experimental results indicating 
that active sites are well conserved in protein structures. 

[mportantly, the false-positive rate when using struc- 
tural information is much lower than that found using 
sequence-based approaches, as demonstrated by a 
detailed comparison of the FFF structural approach and 
the Blocks sequence-motif approach (N. Siew et a\., 
unpublished). In this study, the sequences in eight 
genomes, including Bacillus suhtilis, were analysed for 
disulfide-oxidoreductase function using the disulfide- 
oxidoreductase FFF, the thioredoxin Block 00194 and 
the glutaredoxin Block 00195. If we assume that those 
sequences identified by both the FFF and Blocks 
are *true positives', we find 13 such sequences in the 
B. subtilis genome. 

There is no experimental evidence validating all of 
these *true positives* and so they are more accurately 
termed 'consensus positives'. In order to find these 13 
'consensus positive* sequences, the FFF hits seven false 
positives. On the other hand, Blocks hits 23 false 
positives (Fig. 3). It was previously suggested that the 
use of a functional requirement adds information to 
threading and reduces the number of false positives^^. 
These data, including the data shown in Fig. 3, validate 
this claim on a genome- wide basis. 

Of course, as no genome has had the fimction of all 
of its proteins experimentally annotated, it is imposs- 
ible to know how many other proteins with the speci- 
fied biochemical flinction were not properly identified. 
This is a critical question for researchers attempting to 
predict protein function. Experimental confirmation 
will be needed to validate this or any other method 
fully. This points out the need for closely coupling 
computational function-prediction algorithms with 
experiments. 

Weaknesses of using the sequence-to-structure- 
to-function method of function prediction 

Based on studies to date, the identification of enzy- 
matic activity requires a model in which the backbone 
RMSD fit)m native near the active sites is about 4-5 A. 
Predicted models are better at describing the geometry 
in the core of the molecule than in the loops and so 
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Figure 1 

The distribution of root-mean-square distributions (RMSD) between the hydrolase 
catalytic triad and all other histidine<ontaining triads shows a bimodal distribution 
(a); by contrast, the RMSD between a randonnly selected (non-catalytic) triad and all 
other histidine-containing triads has a unimodal distribution (b). The His-Ser-Asp 
catalytic trial in the protein-1 gpl (Rp2 lipase) (a) and a random histidine<ontaining 
triad from 4pga (glutaminase-asparaginase) (b) were structurally aligned to alt His- 
containing triads in a database of 1037 proteins23. Actual a/(3-hydrolase active sites 
(a) and the 4pga site (b) are indicated by blue bars; other histidine triads that are 
not active sites are indicated by red bars. None of the sites found by matching to the 
4pga were hydrolase active sites. Inset graphs show the full distribution. 

predicting the function of a protein whose active site is 
in loops may be a problem. Also, the method can cur- 
rendy only be applied to enzyme active sites; substrate- 
and ligand-binding sites have not been identified using 
the inexact models. Techniques that will fiirther refine 
inexact protein models will be quite usefiil in taking 
the protein analysis to the next step. 

Conclusions 

Although sequence-based approaches to protein- 
function prediction have proved to be very useful, alter- 
natives are needed to assign the biochemical fiinction 
of the 30-50% of proteins whose function cannot be 
assigned by any current methods. One emerging 
approach involves the sequence-to-structure-to-funcdon 
paradigm. Such structures might be provided by struc- 
tural-genomics projects or by structure-prediction 
algorithms. Functional assignment is made by screen- 
ing the resulting structure against a library of structural 
descriptors for known active sites or binding regions. 
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